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USING LINEAR PREDICTORS TO IMPUTE ALLELE 
FREQUENCIES FROM SUMMARY OR POOLED 
GENOTYPE DATA 

By Xiaoquan Wen and Matthew Stephens 1 

University of Chicago 

Recently-developed genotype imputation methods are a powerful 
tool for detecting untyped genetic variants that affect disease suscep- 
tibility in genetic association studies. However, existing imputation 
methods require individual-level genotype data, whereas, in practice, 
it is often the case that only summary data are available. For ex- 
ample, this may occur because, for reasons of privacy or politics, 
only summary data are made available to the research community at 
large; or because only summary data are collected, as in DNA pool- 
ing experiments. In this article we introduce a new statistical method 
that can accurately infer the frequencies of untyped genetic variants 
in these settings, and indeed substantially improve frequency esti- 
mates at typed variants in pooling experiments where observations 
are noisy. Our approach, which predicts each allele frequency using 
a linear combination of observed frequencies, is statistically straight- 
forward, and related to a long history of the use of linear methods 
for estimating missing values (e.g., Kriging). The main statistical 
novelty is our approach to regularizing the covariance matrix esti- 
mates, and the resulting linear predictors, which is based on methods 
from population genetics. We find that, besides being both fast and 
flexible — allowing new problems to be tackled that cannot be han- 
dled by existing imputation approaches purpose-built for the genetic 
context — these linear methods are also very accurate. Indeed, impu- 
tation accuracy using this approach is similar to that obtained by 
state-of-the-art imputation methods that use individual-level data, 
but at a fraction of the computational cost. 

1. Introduction. Genotype imputation [Servin and Stephens (2008), Guan 
and Stephens (2008), Marchini et al. (2007), Howie, Donnelly and Marchini 
(2009), Browning and Browning (2007), Huang et al. (2009)] has recently 
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emerged as a useful tool in the analysis of genetic association studies as a 
way of performing tests of association at genetic variants (specifically, Sin- 
gle Nucleotide Polymorphisms, or SNPs) that were not actually measured 
in the association study. In brief, the idea is to exploit the fact that un- 
typed SNPs are often correlated, in a known way, with one or more typed 
SNPs. Imputation-based approaches exploit these correlations, using ob- 
served genotypes at typed SNPs to estimate, or impute, genotypes at un- 
typed SNPs, and then test for association between the imputed genotypes 
and phenotype, taking account of uncertainty in the imputed genotypes. 
(Although in general statistics applications the term "imputation" may im- 
ply replacing unobserved data with a single point estimate, in the genetic 
context it is often used more broadly to include methods that consider the 
full conditional distribution of the unobserved genotypes, and this is the 
way we use it here.) These approaches have been shown to increase overall 
power to detect associations by expanding the number of genetic variants 
that can be tested for association [Servin and Stephens (2008), Marchini 
et al. (2007)], but perhaps their most important use has been in performing 
meta-analysis of multiple studies that have typed different, but correlated, 
sets of SNPs [e.g., Zeggini et al. (2008)]. 

Existing approaches to imputation in this context have been developed 
to work with individual-level data: given genotype data at typed SNPs in 
each individual, they attempt to impute the genotypes of each individual at 
untyped SNPs. From a general statistical viewpoint, one has a large number 
of correlated discrete- valued random variables (genotypes), whose means 
and covariances can be estimated, and the aim is to predict values of a 
subset of these variables, given observed values of all the other variables. 
Although one could imagine applying off-the-shelf statistical methods to 
this problem [e.g., Yu and Schaid (2007) consider approaches based on linear 
regression] , in practice, the most successful methods in this context have used 
purpose-built methods based on discrete Hidden Markov Models (HMMs) 
that capture ideas from population genetics [e.g., Li and Stephens (2003), 
Scheet and Stephens (2005)]. 

In this paper we consider a related, but different, problem: given the 
frequency of each allele at all typed SNPs, we attempt to impute the fre- 
quency of each allele at each untyped SNP. We have two main motivations 
for considering this problem. The first is that, although most large-scale as- 
sociation studies collect individual- level data, it is often the case that, for 
reasons of privacy [Homer et al. (2008b), Sankararaman et al. (2009)] or 
politics, only the allele frequency data (e.g., in cases vs. controls) are made 
available to the research community at large. The second motivation is an 
experimental design known as DNA pooling [Homer et al. (2008a), Meaburn 
et al. (2006)], in which individual DNA samples are grouped into "pools" 
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and high-throughput genotypings are performed on each pool. This experi- 
mental design can be considerably cheaper than separately genotyping each 
individual, but comes at the cost of providing only (noisy) estimates of the 
allele frequencies in each pool. In this setting the methods described here 
can provide not only estimates of the allele frequencies at untyped SNPs, 
but also more accurate estimates of the allele frequencies at typed SNPs. 

From a general statistical viewpoint, this problem of imputing frequencies 
is not so different from imputing individual genotypes: essentially, it simply 
involves moving from discrete-valued variables to continuous ones. However, 
this change to continuous variables precludes direct use of the discrete HMM- 
based methods that have been applied so successfully to impute individual 
genotypes. The methods we describe here come from our attempts to extend 
and modify these HMM-based approaches to deal with continuous data. In 
doing so, we end up with a considerably simplified method that might be 
considered an off-the-shelf statistical approach: in essence, we model the al- 
lele frequencies using a multivariate normal distribution, which results in 
unobserved frequencies being imputed using linear combinations of the ob- 
served frequencies (as in Kriging, for example). Some connection with the 
HMM-based approaches remains though, in how we estimate the mean and 
variance-covariance matrix of the allele frequencies. In particular, consider- 
ation of the HMM-based approaches lead to a natural way to regularize the 
estimated variance-covariance matrix, making it sparse and banded: some- 
thing that is important here for both computational and statistical reasons. 
The resulting methods are highly computationally efficient, and can easily 
handle very large panels (phased or unphased). They are also surprisingly 
accurate, giving estimated allele frequencies that are similar in accuracy to 
those obtained from state-of-the-art HMM-based methods applied to indi- 
vidual genotype data. That is, one can estimate allele frequencies at untyped 
SNPs almost as accurately using only the frequency data at typed SNPs as 
using the individual data at typed SNPs. Furthermore, when individual-level 
data are available, one can also apply our method to imputation of individ- 
ual genotypes (effectively by treating each individual as a pool of 1), and 
this results in imputation accuracy very similar to that of state-of-the-art 
HMM-based methods, at a fraction of the computational cost. Finally, in 
the context of noisy data from pooling experiments, we show via simulation 
that the method can produce substantially more accurate estimated allele 
frequencies at genotyped markers. 

2. Methods and models. We begin by introducing some terminology and 
notation. 

A SNP (Single Nucleotide Polymorphism) is a genetic marker that usually 
exhibits two different types (alleles) in a population. We will use and 1 
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to denote the two alleles at each SNP, with the labeling being essentially 
arbitrary. 

A haplotype is a combination of alleles at multiple SNPs residing on a 
single copy of a genome. Each haplotype can be represented by a string of 
binary (0/1) values. Each individual has two haplotypes, one inherited from 
each parent. Routine technologies read the two haplotypes simultaneously 
to produce a measurement of the individual's genotype at each SNP, which 
can be coded as 0, 1 or 2 copies of the 1 allele. Thus, the haplotypes them- 
selves are not usually directly observed, although they can be inferred using 
statistical methods [Stephens, Smith and Donnelly (2001)]. Genotype data 
where the haplotypes are treated as unknown are referred to as "unphased," 
whereas if the haplotypes are measured or estimated, they are referred to as 
"phased." 

In this paper we consider the following form of imputation problem. We 
assume that data are available on p SNPs in a reference panel of data on 
m individuals samples from a population, and that a subset of these SNPs 
are typed on a further study sample of individuals taken from a similar 
population. The goal is to estimate data at untyped SNPs in the study 
sample, using the information on the correlations among typed and untyped 
SNPs that is contained in the reference panel data. 

We let M denote the panel data, and hi , . . . , h.2 n denote the 2n haplo- 
types in the study sample. In the simplest case, the panel data will be a 
2m x p binary matrix, and the haplotypes hi, ... , h2 n can be thought of as 
additional rows of this matrix with some missing entries whose values need 
"imputing." Several papers have focused on this problem of "individual- 
level" imputation [Servin and Stephens (2008), Scheet and Stephens (2005), 
Marchini et al. (2007), Browning and Browning (2007), Li, Ding and Abeca- 
sis (2006)]. In this paper we consider the problem of performing imputation 
when only summary- level data are available for hi, ... , h.2 n . Specifically, let 



denote the vector of allele frequencies in the study sample. We assume that 
these allele frequencies are measured at a subset of typed SNPs, and consider 
the problem of using these measurements, together with information in M, 
to estimate the allele frequencies at the remaining untyped SNPs. More 
formally, if (yt,y u ) denotes the partition of y into typed and untyped SNPs, 
our aim is to estimate the conditional distribution y n |yt,M. 

Our approach is based on the assumption that hi,...,h2 n are indepen- 
dent and identically distributed (i.i.d.) draws from some conditional dis- 
tribution Pr(h|M). Specifically, in common with many existing approaches 
to individual- level imputation [Stephens and Scheet (2005), Marchini et al. 
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(2007), Li, Ding and Abecasis (2006)], we use the HMM-based conditional 
distribution from Li and Stephens (2003), although other choices could be 
considered. It then follows by the central limit theorem, provided that the 
sample size 2n is large, the distribution of y|M can be approximated by a 
multivariate normal distribution: 

(2.2) y|M~N p (/i,E), 

where /i = E(h|M) and E = i Var(h|M). 

From this joint distribution, the required conditional distribution is eas- 
ily obtained. Specifically, by partitioning fi and E in the same way as y, 
according to SNPs' typed/untyped status, (2.2) can be written 



(2.3) 
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and 

(2.4) y u |yt,M ~ N 9 (/2 U + E^Vt " A*t), £«« - E^E^E^). 

The mean of this last distribution can be used as a point estimate for the 
unobserved frequencies y u , while the variance gives an indication of the 
uncertainty in these estimates. (In principle, the mean can lie outside the 
range [0, 1), in which case we use or 1, as appropriate, as the point estimate; 
however, this happens very rarely in practice.) 

The parameters n and E must be estimated from the panel data. It may 
seem natural to estimate these using the empirical mean fP anel and the 
empirical covariance matrix £ panel from the panel. However, £ panel is highly 
rank deficient because the sample size m in the panel is far less than the 
number of SNPs p, and so this empirical matrix cannot be used directly. Use 
of the conditional distribution from Li and Stephens solves this problem. 
Indeed, under this conditional distribution E(h|M) = ft and Var(h|M) = E 
can be derived analytically (Appendix A) as 

(2.5) A = (l-#)f pand + ^l, 

(2.6) t = (i-e) 2 s+ d -(i- e -y, 

where 9 is a parameter relating to mutation, and S is obtained from S panel 
by shrinking off-diagonal entries toward 0. Specifically, 
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Empirical Estimate Shrinkage Estimate 



•t <- 



FlG. 1. Comparison of empirical and shrinkage estimates (based on Li and Stephens 
Model) of squared correlation matrix from the panel. Both of them are estimated using 
Hapmap CEU panel with 120 haplotypes. The region plotted is on chromosome 22 and 
contains 1000 Affymetrix SNPs which cover a 15 Mb genomic region. Squared correlation 
values in [0.05,1.00] are displayed using R's heat. colors scheme, with gold color represent- 
ing stronger correlation and red color representing weaker correlation. 




where pij is an estimate of the population-scaled recombination rate be- 
tween SNPs i and j [e.g., Hudson (2001), Li and Stephens (2003), McVean, 
Awadalla and Fearnhead (2002)]. We use the value of 6 suggested by Li and 
Stephens (2003), 

M o\ a _ (Zli=l Vj) 

^ ' O , ^2m-l, /.x_i> 

and values of obtained by applying the software PHASE [Stephens and 
Scheet (2005)] to the HapMap CEU data, which are conveniently distributed 
with the IMPUTE software package. For SNPs i and j that are distant, 
exp(— Of-) ~ 0. To exploit the benefits of sparsity, we set any value that was 
less than 10 -8 to be 0, which makes S sparse and banded: see Figure 1 for 
illustration. This makes matrix inversion in (2.4) computationally feasible 
and fast, using standard Gaussian elimination. 

2.1. Incorporating measurement error and overdispersion. Our treatment 
above assumes that the allele frequencies of typed SNPs, yt, are observed 
without error. In some settings, for example, in DNA pooling experiments, 
this is not the case. We incorporate measurement error by introducing a 
single parameter e 2 , and assume 

(2.9) y t obs |yr c ~N p ^(yr°,e 2 /), 

where random vectors y° bs and y* ruc represent the observed and true sample 
allele frequencies for typed SNPs respectively, and subscript p — q denotes 
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the number of typed SNPs. We assume that, given y^ e , the observations 
y° bs are conditionally independent of the panel data (M) and the allele 
frequencies at untyped SNPs (y„ rue ). 

Our treatment in the previous section also makes several other implicit 
assumptions: for example, that the panel and study individuals are sampled 
from the same population, and that the parameters p and 9 are estimated 
without error. Deviations from these assumptions will cause overdispersion: 
the true allele frequencies will lie further from their expected values than 
the model predicts. To allow for this, we modify (2.2) by introducing an 
overdispersion parameter a 2 : 

(2.10) y tmc |M~N p (/i,a 2 S). 

Overdispersion models like this are widely used for modeling binomial data 
[McCullagh and Nelder (1989)]. 

In our applications below, for settings involving measurement error (i.e., 
DNA pooling experiments), we estimate a 2 , e 2 by maximizing the multi- 
variate normal likelihood: 

(2.11) y° bs |M ~ N p _ 9 (A t , a 2 ±tt + e 2 I). 

For settings without measurement error, we set e 2 = and estimate a 2 by 
maximum likelihood. 

From the hierarchical model defined by (2.9) and (2.10), the conditional 
distributions of allele frequencies at untyped and typed SNPs are given by 



y« |y° >m ~ n, ( a„ + s ui ( s tt + - 2 i j ( y p - £ t ) 



i e 

(2.12) 
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We use (2.12) to impute allele frequencies at untyped SNPs. In particular, 
we use the conditional mean 



(2.i4) yir=A u +Eu t ( i)tt + V) (yt obs -A*) 
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as a natural point estimate for these allele frequencies. In settings involving 
measurement error, we use (2.13) to estimate allele frequencies at typed 
SNPs, again using the mean 

( 2 -!5) = {^u + ±i) 1 (^At + ^y? bs ) 

as a point estimate. Note that this mean has an intuitive interpretation 
as a weighted average of the observed allele frequency at that SNP and 
information from other nearby, correlated, SNPs. For example, if two SNPs 
are perfectly correlated, then in the presence of measurement error, the 
average of the measured frequencies will be a better estimator of the true 
frequency than either of the single measurements (assuming measurement 
errors are uncorrelated) . The lower the measurement error, e 2 , the greater 
the weight given to the observed frequencies; and when e 2 = the estimated 
frequencies are just the observed frequencies. 

Remark. For both untyped and typed SNPs, our point estimates for 
allele frequencies, (2.14) and (2.15), are linear functions of the observed allele 
frequencies. Although these linear predictors were developed based on an 
appeal to the Central Limit Theorem, and resultant normality assumption, 
there are alternative justifications for use of these particular linear functions 
that do not rely on normality. Specifically, assuming that the two haplotypes 
making up each individual are i.i.d. draws from a conditional distribution 
Pr(h|M) with mean fi and variance-covariance matrix cr 2 £, then the linear 
predictors (2.14) and (2.15) minimize the integrated risk (assuming squared 
error loss) among all linear predictors [West and Harrison (1997)]. In this 
sense they are the best linear predictors, and so we refer to this method of 
imputation as Best Linear IMPutation or BLIMP. 

2.2. Extension to imputing genotype frequencies. The development above 
considers imputing unobserved allele frequencies. In some settings one might 
also want to impute genotype frequencies. A simple way to do this is to use 
an assumption of Hardy- Weinberg equilibrium: that is, to assume that if y 
is the allele frequency at the untyped SNP, then the three genotypes have 
frequencies (1 — y) 2 ,2y(l — y) and y 2 . Under our normal model, the expected 
values of these three quantities can be computed: 

E((l-y) 2 |y t ° bs ,M) = (l-E(y|y° bs ,M)) 2 + Var(y|y t obs ,M), 

(2.16) E(y 2 |y° bs ,M) = (E(y| y ° bs ,M)) 2 +Var(y| y ° bs ,M), 

E(2y(l - y)\y? S ,M) = 1 - E((l - y) 2 \y? S ,M) - E(y 2 |yf S ,M), 

where E(y|yf s ,M) and Var(y|y° bs , M) are given in (2.12). These expecta- 
tions can be used as estimates of the unobserved genotype frequencies. 
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The method above uses only allele frequency data at typed SNPs. If data 
are also available on genotype frequencies, as might be the case if the data 
are summary data from a regular genome scan in which all individuals were 
individually genotypes, then an alternative approach that does not assume 
HWE is possible. In brief, we write the unobserved genotype frequencies as 
means of the genotype indicators, lu=o] an d l[g=2] [analogous to expres- 
sion (2.1)], and then derive expressions for the means and covariances of 
these indicators both within SNPs and across SNPs. Imputation can then 
be performed by computing the appropriate conditional distributions using 
the joint normal assumption, as in (2.4). See Appendix B for more details. 

In practice, we have found these two methods give similar average accu- 
racy (results not shown), although this could be because our data conform 
well to Hardy-Weinberg equilibrium. 

2.3. Individual-level genotype imputation. Although we developed the 
above model to tackle the imputation problem when individual genotypes 
are not available, it can also be applied to the problem of individual-level 
genotype imputation when individual-level data are available, by treating 
each individual as a pool of two haplotypes (application of these methods 
to small pool sizes is justified by the Remark above). For example, doubling 
(2.14) provides a natural estimate of the posterior mean genotype for an un- 
typed SNP. For many applications this posterior mean genotype may suffice; 
see Guan and Stephens (2008) for the use of such posterior means in down- 
stream association analyses. If an estimate that takes a value in {0, 1,2} is 
desired, then a simple ad hoc procedure that we have found works well in 
practice is to round the posterior mean to the nearest integer. Alternatively, 
a full posterior distribution on the three possible genotypes can be computed 
by using the genotypic version of our approach (Appendix B). 

2.4. Using unphased genotype panel data. Our method can be readily 
adapted to settings where the panel data are unphased. To do this, we note 
that the estimates (2.5), (2.6) for fi and S depend on the panel data only 
through the empirical mean and variance-covariance matrix of the panel 
haplotypes. When the panel data are unphased, we simply replace these 
with 0.5 times the empirical mean and variance-covariance matrix of the 
panel genotypes [since, assuming random mating, genotypes are expected 
to have twice the mean and twice the (co)variance of haplotypes]; see Weir 
(1979) for related discussion. 

2.5. Imputation without a panel. In some settings, it may be desired to 
impute missing genotypes in a sample where no individuals are typed at 
all SNPs (i.e., there is no panel M), and each individual is typed at a dif- 
ferent subset of SNPs. For example, this may arise if many individuals are 
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sequenced at low coverage, as in the currently-ongoing 1000 genomes project 
(http://www.1000genomes.org). In the absence of a panel, we cannot di- 
rectly obtain the mean and variance-covariance estimates fi and £ as in 
(2.5) and (2.6). An alternative way to obtain these estimates is to treat 
each individual genotype vector as a random sample from multivariate nor- 
mal distribution N p (/i, £), and apply the ECM algorithm [Meng and Rubin 
(1993)] to perform maximum likelihood estimation. However, this approach 
does not incorporate shrinkage. We therefore modify the algorithm in an 
ad-hoc way to incorporate shrinkage in the conditional maximization step. 
See Appendix C for details. 

3. Data application and results. We evaluate methods described above 
by applying them to data from a subset of the WTCCC Birth Cohort, con- 
sisting of 1376 unrelated British individuals genotyped on the Affymetrix 
500K platform [Wellcome Trust Case Control Consortium (2007)]. For demon- 
stration purposes, we use only the 4329 SNPs from chromosome 22. We 
impute data at these SNPs using the 60 unrelated HapMap CEU parents 
[The International HapMap Consortium (2005)] as the panel. For the re- 
combination parameters required in (2.7), we use the estimates distributed 
in the software package IMPUTE vl [Marchini et al. (2007)], which were es- 
timated from the same panel using the software package PHASE [Stephens 
and Scheet (2005)]. 

In our evaluations we consider three types of applications: frequency impu- 
tation using summary-level data, individual-level genotype imputation and 
noise reduction in DNA pooling experiments. We examine both the accuracy 
of point estimates and calibration of the credible intervals. 

3.1. Frequency imputation using summary-level data. In this section we 
evaluate the performance of (2.14) for imputing frequencies at untyped 
SNPs. The observed data consist of the marginal allele frequencies at each 
SNP, which we compute from the WTCCC individual-level genotype data. 
To assess imputation accuracy, we perform the following cross-validation 
procedure: we mask the observed data at every 25th SNP, then treat the 
remaining SNPs as typed and use them to impute the frequencies of masked 
SNPs and compare the imputation results with the actual observed frequen- 
cies. We repeat this procedure 25 times by shifting the position of the first 
masked SNP. Because in this case the observed frequencies are obtained 
through high quality individual-level genotype data, we assume the experi- 
mental error parameter e 2 = 0. 

To provide a basis for comparison, we also perform the same experiment 
using the software package IMPUTE vl [Marchini et al. (2007)], which is 
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among the most accurate of existing methods for this problem. IMPUTE re- 
quires individual-level genotype data and outputs posterior genotype prob- 
abilities for each unmeasured genotype. We therefore input the individual- 
level genotype data to IMPUTE and estimate the allele frequency at each 
untyped SNP using the posterior expected frequency computed from the 
posterior genotype probabilities. Like our method, IMPUTE performs im- 
putation using the conditional distribution from Li and Stephens [Li and 
Stephens (2003)]; however, it uses the full conditional distribution, whereas 
our method uses an approximation based on the first two moments. Further- 
more, IMPUTE uses individual-level genotype data. For both these reasons, 
we would expect IMPUTE to be more accurate than our method, and our 
aim is to assess how much we lose in accuracy by our approximation and by 
using summary-level data. 



To assess accuracy of estimated allele frequencies, we use the Root Mean 
Squared Error (RMSE), 



where J is the number of SNPs tested (4329) and yj,yj are observed and 
imputed allele frequencies for SNP j respectively. 

The RMSE from our method was 0.0157 compared with 0.0154 from IM- 
PUTE (Table 1). Thus, for these data, using only summary-level data sac- 
rifices virtually nothing in accuracy of frequency imputation. Furthermore, 
we found that using an unphased panel (replacing the phased HapMap CEU 
haplotypes with the corresponding unphased genotypes) resulted in only a 
very small decrease in imputation accuracy: RMSE = 0.0159. In all cases 
the methods are substantially more accurate than a "naive method" that 
simply estimates the sample frequency using the panel frequency (Table 1). 

We also investigated the calibration of the estimated variances of the 
imputed frequencies from (2.12). To do this, we constructed a Z-static for 
each test SNP j, 



where yj is the true observed frequency, and the conditional mean and vari- 
ance are as in (2.12). If the variances are well calibrated, the Z-scores should 
follow a standard normal distribution (with slight dependence among Z- 
scores of neighboring SNPs due to LD). Figure 2a shows that, indeed, the 
empirical distribution of Z-scores is close to standard normal (results are 
shown for phased panel; results for unphased panel are similar). Note that 
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a: Model allowing for overdlsperslon 



0.2 0.4 0.6 0.8 

Z-scores binned by standard normal percentiles 

b: Model without overdispersion 



0.0 0.2 0.4 0.6 0.8 1.0 

Z-scores binned by standard normal percentiles 

Fig. 2. Comparison of variance estimation in models with and without overdispersions. 
The Z-scores are binned according to the standard normal percentiles, for example, the first 
bin (0 to 0.05) contains Z-score values from — oo to —1.645. If the Z-scores are i.i.d. and 
strictly follow standard normal distribution, we expect all the bins to have approximately 
equal height. 



the overdispersion parameter plays a crucial role in achieving this calibra- 
tion. In particular, the Z-scores produced by the model without overdisper- 
sion (2.2) do not follow a standard normal distribution, with many more 
observations in the tails (Figure 2b) indicating that the variance is under- 
estimated. 

3.1.1. Comparison with unregularized linear frequency estimator. To as- 
sess the role of regularization, we compared the accuracy of BLIMP with 
simple unregularized linear frequency estimators based on a small number 
of nearby "predicting" SNPs. (In fact, selecting a small number of SNPs can 
be viewed as a kind of regularization, but we refer to it as unregularized for 
convenience.) The unregularized linear estimator has the same form as in 
(2.4), but uses the unregularized estimates f P anel and £ panel for (1 and S. We 
consider two schemes to select predictors: the first scheme selects k flanking 
SNPs on either side of the target SNP (so 2k predictors in total); the sec- 
ond scheme selects the 2k SNPs with the highest marginal correlation with 
the target SNP. Figure 3 shows RMSE as predicting SNPs increases from 
to 50. We find that the best performance of the unregularized methods 
is achieved by the first scheme, with a relatively large number of predict- 
ing SNPs (20-40); however, its RMSE is larger than that of IMPUTE and 
BLIMP. 
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10 20 30 40 50 

Number of Predicting SNPs 

Fig. 3. Comparison between BLIMP estimator and unregularized linear estimators. The 
lines show the RMSE of each allele frequency estimator vs. number of predicting SNPs. 
Results are shown for two schemes for selecting predicting SNPs: flanking SNPs (red line) 
and correlated SNPs (green line). Neither scheme is as accurate as BLIMP (blue solid 
line) or IMPUTE (blue dashed line). 



3.2. Individual-level genotype imputation. Although very satisfactory 
methods already exist for individual-level genotype imputation, BLIMP has 
the potential advantage of being extremely fast and low on memory-usage 
(see computational comparisons below). We therefore also assessed its per- 
formance for individual-level genotype imputation. We used the same cross- 
validation procedure as in frequency imputation, but using individual-level 
data as input. As above, we compared results from our approach with those 
obtained using IMPUTE vl. 

We again use RMSE to measure accuracy of imputed (posterior mean) 
genotypes: 



(3.3) 



RMSE: 



\ 



-. v m 



mp 



=1 i=l 



where m is the number of the individuals (1376), p is the total number of 
tested SNPs (4329) and g^, g^ are observed and estimated (posterior mean) 
genotypes for individual i at SNP j respectively. 

For comparison purposes, we also use a different measure of accuracy that 
is commonly used in this setting: the genotype error rate, which is the num- 
ber of wrongly imputed genotypes divide by the total number of imputed 
genotypes. To minimize the expected value of this metric, one should use the 
posterior mode genotype as the estimated genotype. Thus, for IMPUTE vl 
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Table 1 

Comparison of accuracy of BLIMP and IMPUTE for frequency and 
individual-level genotype imputations. The RMSE and error rate, defined in 
the text, provide different metrics for assessing accuracy; in all cases BLIMP 
was very slightly less accurate than IMPUTE. The "naive method" refers to 
the strategy of estimating the sample frequency of each untyped SNP by its 
observed frequency in the panel; this ignores information in the observed 
sample data, and provides a baseline level of accuracy against which the other 
methods can be compared 



RMSE Error rate 



Frequency imputation 

Naive method 0.0397 NA 

BLIMP (phased panel) 0.0157 NA 

BLIMP (unphased panel) 0.0159 NA 

IMPUTE 0.0154 NA 

Individual genotype imputation 

BLIMP (phased panel) 0.2339 6.46% 

BLIMP (unphased panel) 0.2407 6.77% 

IMPUTE 0.2303 6.30% 



we used the posterior mode genotype for this assessment with this metric. 
However, for simplicity, for our approach we used the posterior mean geno- 
type rounded to the nearest integer. (Obtaining posterior distributions on 
genotypes using our approach, as outlined in Appendix B, is considerably 
more complicated, and in fact produced slightly less accurate results, not 
shown.) 

We found that under either metric BLIMP provides only very slightly less 
accurate genotype imputations than IMPUTE (Table 1). Further, as before, 
replacing the phased panel with an unphased panel produces only a small 
decrease in accuracy (Table 1). 

These results show average accuracy when all untyped SNPs are imputed. 
However, it has been observed previously [e.g., Marchini et al. (2007), Guan 
and Stephens (2008)] that accuracy of calls at the most confident SNPs 
tends to be considerably higher than the average. We checked that this is 
also true for BLIMP. To obtain estimates of the confidence of imputations at 
each SNP, we first estimated a by maximum likelihood using the summary 
data across all individuals, and then computed the variance for each SNP 
using (2.12); note that this variance does not depend on the individual, 
only on the SNP. We then considered performing imputation only at SNPs 
whose variance was less than some threshold, plotting the proportion of 
SNPs imputed ("call rate") against their average genotype error rate as this 
threshold varies. The resulting curve for BLIMP is almost identical to the 
corresponding curve for IMPUTE (Figure 4). 
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Fig. 4. Controlling individual-level genotype imputation error rate on a per-SNP basis. 
For BLIMP, the error rate is controlled by thresholding on the estimated variance for 
imputed SNP frequencies; for IMPUTE the call threshold is determined by average max- 
imum posterior probability. These two different measures of per-SNP imputation quality 
are strongly correlated (correlation coefficient = — 0.983 ). 

3.3. Individual-level genotype imputation without a panel. We use the 
same WTCCC Birth Cohort data to assess our modified ECM algorithm 
for performing individual-level genotype imputation without using a panel. 
To create a data set with data missing at random, we mask each genotype, 
independently, with probability m. We create multiple data sets by varying 
m from 5% to 50%. For each data set we run our ECM algorithm for 20 
iterations. (Results using different starting points for the ECM algorithm 
were generally very consistent, and so results here are shown for a single 
starting point.) 

We compare the imputation accuracy with the software package BIMBAM 
[Guan and Stephens (2008)] which implements the algorithms from Scheet 
and Stephens (2005). 

BIMBAM requires the user to specify a number of "clusters" and other 
parameters related to the EM algorithm it uses: after experimenting with 
different settings we applied BIMBAM on each data set assuming 20 clusters, 
with 10 different EM starting points, performing 20 iterations for each EM 
run. (These settings produced more accurate results than shorter runs.) 

Overall, imputation accuracy of the two methods was similar (Table 2), 
with BLIMP being slightly more accurate with larger amounts of missing 
data and BIMBAM being slightly more accurate for smaller amounts of 
missing data. 

We note that, in this setting, some of the key computational advantages of 
our method are lost. In particular, when each individual is missing genotypes 
at different SNPs, one must effectively invert a different covariance matrix 
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Table 2 

Comparison of imputation error rates from BLIMP 
and BIMBAM for individual genotype imputation 
without a panel 







Missing rate 






5% 


10% 


20% 


50% 


BIMBAM 


5.79% 


6.35% 


7.15% 


9.95% 


BLIMP ECM 


6.07% 


6.49% 


7.31% 


9.91% 



for each individual. Furthermore, this inversion has to be performed multiple 
times, due to the iterative scheme. For small amounts of missing data, the 
results from BLIMP we present here took less time than the results for 
BIMBAM, but for larger amounts the run times are similar. 

3.4. Noise reduction in pooled experiment. We used simulations to as- 
sess the potential for our approach to improve allele frequency estimates 
from noisy data in DNA pooling experiments [equation (2.13)]. To gener- 
ate noisy observed data, we took allele frequencies of 4329 genotyped SNPs 
from the WTCCC Birth Cohort chromosome 22 data as true values, and 
added independent and identically distributed iV(0,£ 2 ) noise terms to each 
true allele frequency. Real pooling data will have additional features not 
captured by these simple simulations (e.g., biases toward one of the alle- 
les), but our aim here is simply to illustrate the potential for methods like 
ours to reduce noise in this type of setting. We varied e from 0.01 to 0.18 
to examine different noise levels. Actual noise levels in pooling experiments 
will depend on technology and experimental protocol; to give a concrete 
example, Meaburn et al. (2006) found differences between allele frequency 
estimates from pooled genotyping and individual genotyping of the same 
individuals, at 26 SNPs, in the range 0.008-0.077 (mean 0.036). 

We applied our method to the simulated data by first estimating a and 
e using (2.12), and then, conditional on these estimated parameters, esti- 
mating the allele frequency at each observed SNP using the posterior mean 
given in equation (2.13). We assessed the accuracy (RMSE) of these allele 
frequency estimates by comparing them with the known true values. 

We found that our method was able to reliably estimate the amount of 
noise present in the data: the estimated values for the error parameter e 
show good correspondence with the standard deviation used to simulate the 
data (Figure 5a), although for high noise levels we underestimate the noise 
because some of the errors are absorbed by the parameter a. 

More importantly, we found our estimated allele frequency estimates were 
consistently more accurate than the direct (noisy) observations, with the 
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Fig. 5. (a,) Detection of experimental noise in simulated data. The simulated data sets 
are generated by adding Gaussian noise N(0,e 2 ) to the actual observed WTCCC frequen- 
cies. The estimated e values are plotted against the true e values used for simulation. We 
estimate e using maximum likelihood by (2.11). (b) An illustration on the effect of noise 
reduction in various noise levels. RMSE from noise reduced estimates are plotted against 
RMSE from direct noisy observations. The noise reduced frequency estimates are posterior 
means obtained from model (2.13). 



improvement being greatest for higher noise levels (Figure 5b). For example, 
with e = 0.05 our method reduced the RMSE by more than half, to 0.024. 

3.5. Computational efficiency. Imputation using our implementation of 
BLIMP is highly computationally efficient. The computational efficiency is 
especially notable when dealing with large panels: the panel is used only to 
estimate fi and S, which is both quick and done just once, after which impu- 
tation computations do not depend on panel size. Our implementations also 
take advantage of the sparsity of £ to increase running speed and reduce 
memory usage. To give a concrete indication of running speed, we applied 
BLIMP to the WTCCC Birth Cohort data on chromosome 22, containing 
4329 genotyped and 29,697 untyped Hapmap SNPs on 1376 individuals, us- 
ing a Linux system with eight-core Intel Xeon 2.66 GHz processors (although 
only one processor is used by our implementation). The running time is mea- 
sured by 'real' time reported by the Unix "time" command. For frequency 
imputation, BLIMP took 9 minutes and 34 seconds, with peak memory us- 
age of 162 megabytes; for individual-level genotype imputation, BLIMP took 
25 minutes, using under 300 megabytes of memory. As a comparison, IM- 
PUTE vl took 195 minutes for individual-level genotype imputation, with 
memory usage exceeding 5.1 gigabytes. 

Since these comparisons were done, we note that a new version of IM- 
PUTE (v2) has been released [Howie, Donnelly and Marchini (2009)]. This 



18 



X. WEN AND M. STEPHENS 



new version gives similar imputation accuracy in the settings we described 
above; it runs more slowly than vl but requires less memory. 

4. Conclusion and discussion. Imputation has recently emerged as an 
important and powerful tool in genetic association studies. In this paper we 
propose a set of statistical tools that help solve the following problems: 

1. Imputation of allele frequencies at untyped SNPs when only summary- 
level data are available at typed SNPs. 

2. Noise reduction for estimating allele frequencies from DNA pooling-based 
experiments. 

3. Fast and accurate individual-level genotype imputation. 

The proposed methods are simple, yet statistically elegant, and computa- 
tionally extremely efficient. For individual-level genotype imputation the 
imputed genotypes from this approach are only very slightly less accurate 
than state-of-the-art methods. When only summary-level data are available, 
we found that imputed allele frequencies were almost as accurate as when 
using full individual genotype data. 

The linear predictor approach to imputation requires only an estimate 
of the mean and the covariance matrix among SNPs. Our approach to ob- 
taining these estimates is based on the conditional distribution from Li and 
Stephens (2003); however, it would certainly be possible to consider other es- 
timates, and, specifically, to use other approaches to shrink the off-diagonal 
terms in the covariance matrix. An alternative, closely-related, approach is 
to obtain a linear predictor directly by training a linear regression to pre- 
dict each SNP, using the panel as a training set, and employing some kind of 
regularization scheme to solve potential problems with overfitting caused by 
large p small n. This approach has been used in the context of individual- 
level genotype imputation by A. Clark (personal communication) and Yu 
and Schaid (2007). However, the choice of appropriate regularization is not 
necessarily straightforward, and different regularization schemes can provide 
different results [Yu and Schaid (2007)]. Our approach of regularizing the 
covariance matrix using the conditional distribution from Li and Stephens 
(2003) has the appeal that this conditional distribution has already been 
shown to be very effective for individual genotype imputation, and for mod- 
eling patterns of correlation among SNPs more generally [Li and Stephens 
(2003), Stephens and Scheet (2005), Servin and Stephens (2008), Marchini 
et al. (2007)]. Furthermore, the fact that, empirically, BLIMP's accuracy is 
almost as good as the best available purpose-built methods for this problem 
suggests that alternative approaches to regularization are unlikely to yield 
considerable improvements in accuracy. 

The accuracy with which linear combinations of typed SNPs can pre- 
dict untyped SNPs is perhaps somewhat surprising. That said, theoretical 
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arguments for the use of linear combinations have been given in previous 
work. For example, Clayton, Chapman and Cooper (2004) showed by ex- 
ample that, when SNP data are consistent with no recombination (as might 
be the case for markers very close together on the genome), each SNP can 
be written as a linear regression on the other SNPs. Conversely, it is easy 
to construct hypothetical examples where linear predictors would fail badly. 
For example, consider the following example from Nicolae (2006a): 3 SNPs 
form 4 haplotypes, 111, 001, 100 and 010, each at frequency 0.25 in a pop- 
ulation. Here the correlation between every pair of SNPs is 0, but knowing 
any 2 SNPs is sufficient to predict the third SNP precisely. Linear predictors 
cannot capture this "higher order" interaction information, so produce sub- 
optimal imputations in this situation. In contrast, other methods (including 
IMPUTE) could use the higher-order information to produce perfect impu- 
tations. The fact that, empirically, the linear predictor works well suggests 
that this kind of situation is rare in real human population genotype data. 
Indeed, this is not so surprising when one considers that, from population 
genetics theory SNPs tend to be uncorrelated only when there is sufficiently 
high recombination rate between them, and recombination will tend to break 
down any higher-order correlations as well as pairwise correlations. 

Besides HMM-based methods, another type of approach to genotype im- 
putation that has been proposed is to use "multi-marker" tagging [de Bakker 
et al. (2005), Purcell et al. (2007), Nicolae (2006b)]. A common feature of 
these methods is to preselect a (relatively small) subset of "tagging" SNPs 
or haplotypes from all typed SNPs based on some LD measure threshold, 
and then use a possibly nonlinear approach to predicting untyped SNPs 
from this subset. Thus, compared with our approach, these methods gener- 
ally use a more complex prediction method based on a smaller number of 
SNPs. Although we have not compared directly with these methods here, 
published comparisons [Howie, Donnelly and Marchini (2009)] suggest that 
they are generally noticeably less accurate than HMM-based methods like 
IMPUTE, and, thus, by implication, less accurate than BLIMP. That is, 
it seems from these results that, in terms of average accuracy, it is more 
important to make effective use of low-order correlations from all available 
SNPs that are correlated with the target untyped SNP than to take account 
of unusual higher-order correlations that may occasionally exist. 

Our focus here has been on the accuracy with which untyped SNP al- 
lele frequencies can be imputed. In practice, an important application of 
these imputation methods is to test untyped alleles for association with an 
outcome variable (e.g., case-control status). Because our allele frequency 
predictors are linear combinations of typed SNP frequencies, each test of 
an untyped SNP is essentially a test for differences between a given linear 
combination of typed SNPs in case and control groups. Several approaches 
to this are possible; for example, the approach in Nicolae (2006b) could be 
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readily applied in this setting. The resulting test would be similar to the test 
suggested in Homer et al. (2008a) which also uses a linear combination of 
allele frequencies at typed SNPs to test untyped SNPs for association with 
case-control status in a pooling context. The main difference is that their 
proposed linear combinations are ad hoc, rather than being chosen to be the 
best linear imputations; as such, we expect that appropriate use of our linear 
imputations should result in more powerful tests, although a demonstration 
of this lies outside the scope of this paper. 

APPENDIX A: LEARNING FROM PANEL USING THE 
LI AND STEPHENS MODEL 

In this section we show the calculation of fi and S using phased population 
panel data. Following the Li and Stephens model, we assume that there are 
K template haplotypes in the panel. For a new haplotype h sampled from 
the same population, it can be modeled as an imperfect mosaic of existing 
template haplotypes in the panel. Letting e,- denote the jth unit vector in K 
dimensions (1 in the jth coordinate, O's elsewhere), we define random vector 
Z t to be ej if haplotype h at locus t copies from jth template haplotype. 
The model also assumes Zi,...,Z n form a Markov chain in state space 
{ei, . . . , e^} with transition probabilities 

(A.l) Pr(Z t = e m |Z*_i = e n , M) = (1 - r t )l [em=en] + r t e' m a, 

where ot = ^l and rt = 1 — exp(—pt/K) is a parameter that controls the 
probability that h switches copying template at locus t. The initial-state 
probabilities of the Markov chain are given by 

(A.2) tt(Zi = e fc |M) = — for k = 1, . . . ,K. 

K 

It is easy to check that the initial distribution ir is also the stationary distri- 
bution of the described Markov chain. Because the chain is initiated at the 
stationary state, it follows that, conditional on M, 

(A.3) Zi = d Z 2 = d ■ ■ ■ = d Z p = d it. 

Therefore, marginally, the means and variances of Zt's have the following 
simple forms: 

(A.4) E(Zi|M) = --- = E(Z p |M) = a, 

(A.5) Var(Zi|M) = • • • = Var(Z„|M) = diag(a) - aa'. 

Let if-dimensional vector qP dnel denote the binary allelic state of panel hap- 
lotypes at locus t and scalar parameter 6 represents mutation. The emission 
distribution in the Li and Stephens model is given by 

(A.6) Pr(^ = l|Z t = e fc , M) = (1 - e)e' k <$ anel + \0, 
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that is, with probability 1 — 9, h perfectly copies from the A:th template in 
the panel at locus t, while, with probability 9, a mutation occurs and ht 
"mutates" to allele or 1 equally likely. If we define pt = (1 — 9)q^ a,ncl + 1 1, 
then the emission distribution can be written as 

(A.7) Pr(h t = l\Z t ,M) = E(h t \Z t ,M) = P%- 

The goal here is to find the closed-form representations of the first two 
moments of joint distribution (hi, hi, ■ ■ ■ , h p ) given the observed template 
panel M. For marginal mean and variance of ht, it follows that 

E(fc t |M)=E(E(/i t |Z t ,M)|M) 

(A.8) =p£E(Zt|M) 

= (i-0)./r cl +^ 

(A.9) Var^lM) = (1 - 9ffr c \l - /f anel ) + °- (l - |) , 

where J t panel = q£ . cx is the observed allele frequency at locus t from panel 
M. Finally, to compute Cov(h s ,ht) for some loci s < t, we notice that, con- 
ditional on Z a and M, h s and ht are independent and 

E(h s ■ h t \M) = E(E(h s ■ ht\Z s ,M)\M) 

(A.10) 

= E(E(/i s |Z s ,M) • E(/i*|Z s ,M)|M). 

Let r s t denote the switching probability between s and t, and E(/i(|Z s ,M) 
can be calculated from 

E(ht\Z s , M) = E(E(ht\Z t , Z s , M)\Z S , M) 

(A.ll) =E(Z; Pt |Z s ,M) 

= ((l-r s t)Z' s +r st a')pt. 

Therefore, 

E(h s ■ ht\M) = p' s E(Z s ((l - r st )Z' s +r st a')\M)pt 
(A.12) = (1 - r s t)p^Var(Z s ) Pt + p' s aa'p t 

= (1 - r st ) • P ' s (diag(a) - aa')p* + E(h s \M)E(h t |M). 

It follows that 

(A.13) Cov(h s ,h t \M) = (1 - 9) 2 (1 - r st )(f!r l - iT"'/^ 1 ), 

where /f t anel is the panel frequency of the haplotype "1-1" consisting of loci 
s and t. 
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In conclusion, under the Li and Stephens model, the distribution h|M 
has expectation 

(A.14) /i = E(h|M) = (l-#)f pancl + ^l, 

where f P anel is the p- vector of observed frequencies of all p SNPs in the 
panel, and variance 

(A.15) t = Var(hjM) = (1 - 9) 2 S + 

where matrix S has the structure 

{f panel/-. _ f paneK • _ • 

(i-^)(/r el -/r nel /j anei ), 



APPENDIX B: DERIVATION OF JOINT GENOTYPE FREQUENCY 

DISTRIBUTION 

In this section we derive the joint genotype frequency distribution based 
on the Li and Stephens model. 

Let gu denote the genotype of individual % at locus t. The sample fre- 
quency of genotype at locus t is given by 

1 n 

(B.1) Pr( 5t = 0)=p^ = -]Tl fe=0] . 

i=i 

Similarly, genotype frequencies pf = Pr(<ft = 1) and p^ = Pr(g< = 2) can 
be obtained by averaging indicators l[ Sii =i] and l[ 9it =2] over the samples 
respectively. Because of the restriction 

(B-2) l [gu=0] +l [gu=1] +l [gu=2] = l, 

given any two of the three indicators, the third one is uniquely determined. 
Let & denote 2p-vector (% 1=0 ], % 1= 2], • • • . % P =o]> 1 [ Sip =2]) and 

n 

,8 



1 - 

(b.3) y, = = 



n 
i=i 



Assuming that gi, . . . , g n are i.i.d. draws from conditional distribution Pr(g|M), 
by the central limit theorem as sample size n is large, it follows 

(B.4) y 9 |M~N 2p Q^,E s ), 

where fi g = E(g|M) and £ 9 = Var(g|M). 

For the remaining part of this section, we derive the closed- form expres- 
sions for fi g and S g based on the Li and Stephens model. 
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Let h a and h b denote the two composing haplotypes for some genotype 
sampled from the population. Note that 

± [gt=0] = (l-h?)(l-h b ), 

(B.5) 



1 



[gt=2] 



Given panel M, the two composing haplotypes are also assumed to be in- 
dependent and identically distributed. Following the results from Appendix 
A, we obtain that 



E (%=o] 

E (%=2] 

(B.6) Var(l [fl(=0] 
Var(l [9t=2] 
Cov(l [gt=0 ],l [gt=2] 



M) = (l-E(/ it |M)) i , 
M) = E(/i t |M) 2 , 

M) = (1 - E(/i t |M)) 2 • (1 - (1 - E(/i t |M)) 2 ), 
M) = E(/i t |M) 2 -(l-E(/i t |M) 2 ), 
M) = -(l-E(/ lt |M)) 2 -E(/ l t|M) 2 , 



where E(/i t |M) is given by (A. 8). 

To compute covariance across different loci s and t, we note that 

%=o]%=o] = (1 - K)(l - h b s )(l - h a t ){\ - h b t ), 



(B.7) 



l [gs=2] t [gt=2] = h a s h b s h?h b 



t ■ 



1 [g s =0] 1 \gt=2] 



(l- K )(l-h b s )h a t h\, 



1 



[ 9s =2]l[ 9t =0] 



Kh b s {\-h a t ){i-K). 



Then all the covariance terms across different loci can be represented using 
E(/i s /itjM), which is given by (A. 12). For example, 



(B. 



Cov(l [Ss=2]) l [gt=2] |M) 

= Cov(/t s , h t \M) 2 + 2E(/i s |M) • E(/t t |M) • Cov(/i s , h t \M). 



APPENDIX C: MODIFIED ECM ALGORITHM FOR IMPUTING 

GENOTYPES WITHOUT A PANEL 

In this section we show our modified ECM algorithm for genotype impu- 
tation without a panel. 

By our assumption, each individual genotype p-vector g* is a random 
sample from the multivariate normal distribution N p (/i,S), and different 
individual vectors may have different missing entries. Suppose we have n 
individual samples, then let G Q b s denote the set of all typed genotypes across 
all individuals and g^ bs denote the typed genotypes for individual i. 
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In the E step of the ECM algorithm, we compute the expected values 
of the sufficient statistics Y17=i 9j f° r 3 ' = 1> • ■ • >P an< ^ Ya=i 9j9k ^ or J> ^ = 
1, . . . ,p conditional on G Q b s and the current estimate for (fi, £). Specifically, 
in the ith iteration, 



(c.i) e X>j|G bs,A (<) ,£ (t) )=X>: 



*,(*) 

3 ' 



i=l 



e(J># G obs ,A^,^)=E(4 (t) ^+4f)' 

i=l 



vi=l 



.(*) _ J ^ 



if is typed, 



E(<4|g obs , p,®,£®), if g) is untyped, 



(C.2) 

where 
(C.3) 
and 

(C.4) 

The calculation of E(sj|g obs , /i (<) , £(*)) and Cov(^,^|g* bs , p,® ,W) follows 
directly from (2.4). 

In the conditional maximization step, we first update the estimates for 

f panel and jj panel sequent ially, that is, 



0. 



1 Cov(g*-,^|g obs ,/i^,SW), if g % - and ^ are both untyped. 



if g) or is typed, 



(C 5) jpanel,(t+l) 



and 



(C.6) 



-,pancl.(t+l) 



\i=i 



for j = l,...,p, 



G obs ,A (i) ,s (t) 



npanel,(t+l) ,.pancl,(t+l) 

h h 



for j,k = l,...,p. 
Finally, we update the shrinkage estimates fi and S using 
(C.7) 



A (t+l) = ( 1 _ e)fP anel,(t+l) + ^ 1) 



(C.8) 
where 

(C.9) 



s(*+i) = (l-0)25(*+ 1 ) + ^l-0/, 



S 



(t+i) 



r-<panel,(t+l) 



cxp 



Pjrfc \ v panel,(t+l) 

'2nJ 



i = k, 

ij ^ k. 
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We initiated the ECM algorithm by setting fP ancl >(°) to the marginal 
means from all observed data and £ panel, W to a diagonal matrix with diag- 
onal entries being empirical variance computed from typed SNPs. 

Acknowledgments. We thank Yongtao Guan and Bryan Howie for help- 
ful discussions. 

REFERENCES 

Browning, S. and Browning, B. (2007). Rapid and accurate haplotype phasing and 

missing data inference for whole genome association studies using localized haplotype 

clustering. American Journal of Human Genetics 81 1084-1097. 
Clayton, D., Chapman, J. and Cooper, J. (2004). Use of unphased multilocus genotype 

data in indirect association studies. Genetic Epidemiology 27 415-428. 
de Barker, P., Yelensky, R., Pe'er, I., Gabriel, S., Daly, M. and Altshuler, D. 

(2005). Efficiency and power in genetic association studies. Nature Genetics 37 1217- 

1223. 

Guan, Y. and Stephens, M. (2008). Practical issues in imputation-based association 

mapping. PLoS Genetics 4 el000279. 
Homer, N., Tembe, W., Szelinger, S., Redman, M., Stephan, D., Pearson, J., 

Nelson, D. and Craig, D. (2008a). Multimarker analysis and imputation of multiple 

platform pooling-based genome-wide association studies. Bioinformatics 24 1896-1902. 
Homer, N., Szelinger, S., Redman, M., Duggan, D., Tembe, W., Muehling, J., 

Pearson, J., Stephan, D., Nelson, S. and Craig, D. (2008b). Resolving individuals 

contributing trace amounts of dna to highly complex mixtures using high-density snp 

genotyping microarrays. PLoS Genetics 4 el000167. 
Howie, B., Donnelly, P. and Marchini, J. (2009). A flexible and accurate genotype 

imputation method for the next generation of genome-wide association studies. PLoS 

Genetics 5 el000529. 

Huang, L., Li, Y., Singleton, A., Hardy, J., Abecasis, G., Rosenberg, N. and 
Scheet, P. (2009). Genotype-imputation accuracy across worldwide human popula- 
tions. American Journal of Human Genetics 84 235-250. 

Hudson, R. (2001). Two-locus sampling distributions and their application. Genetics 159 
1805-1817. 

Li, N. and Stephens, M. (2003). Modelling linkage disequilibrium and identifying recom- 
bination hotspots using snp data. Genetics 165 2213-2233. 
Li, Y., Ding, J. and Abecasis, G. (2006). Mach 1.0: Rapid haplotype reconstruction 

and missing genotype inference. American Journal of Human Genetics 79 S2290. 
Marchini, J., Howie, B., Myers, S., McVean, G. and Donnelly, P. (2007). A new 

multipoint method for genome-wide association studies by imputation of genotypes. 

Nature Genetics 39 906-913. 
McCullagh, P. and Nelder, J. (1989). Generalized Linear Models, 2nd ed. Chapman 

and Hall, London. MR0727836 
McVean, G., Awadalla, P. and Fearnhead, P. (2002). A coalescent- based method for 

detecting and estimating recombination from gene sequences. Genetics 160 1231-1241. 
Meaburn, E., Butcher, L., Schalkwyk, L. and Plomin, R. (2006). Genotyping pooled 

DNA using 100k snp microarrays: A step towards genomewide association scans. Nucleic 

Acids Research 34 e28. 
Meng, X. and Rubin, D. (1993). Maximum likelihood estimation via the ECM algorithm: 

A general framework. Biometrika 80 267-278. MR1243503 



26 



X. WEN AND M. STEPHENS 



NlCOLAE, D. (2006a). Quantifying the amount of missing information in genetic associa- 
tion studies. Genetic Epidemiology 30 703-717. 

NlCOLAE, D. (2006b). Testing untyped alleles (tuna)-applications to genome- wide associ- 
ation studies. Genetic Epidemiology 30 718-727. 

Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M., Bender, D., 
Maller, J., Sklar, P., de Bakker, P., Daly, M. and Sham, P. (2007). Plink: A 
toolset for whole-genome association and population-based linkage analysis. American 
Journal of Human Genetics 81 559-575. 

Sankararaman, S., Obozinski, G., Jordan, M. and Halperin, E. (2009). Genomic 
privacy and limits of individual detection in a pool. Nature Genetics 41 965-967. Epub. 

Scheet, P. and Stephens, M. (2005). A fast and flexible statistical model for large-scale 
population genotype data: Applications to inferring missing genotypes and haplotype 
phase. American Journal of Human Genetics 78 629-644. 

Servin, B. and Stephens, M. (2008). Imputation-based analysis of association studies: 
Candidate regions and quantitative traits. PLoS Genetics 3 el 14. 

Stephens, M. and Scheet, P. (2005). Accounting for decay of linkage disequilibrium in 
haplotype inference and missing-data imputation. American Journal of Human Genet- 
ics 76 449-462. 

Stephens, M., Smith, N. and Donnelly, P. (2001). A new statistical method for hap- 
lotype reconstruction from population data. American Journal of Human Genetics 68 
978-989. 

The International HapMap Consortium (2005). A haplotype map of the human 
genome. Nature 437 1299-1320. 

Weir, B. (1979). Inferences about linkage disequilibrium. Biometrics 35 235-254. 

Wellcome Trust Case Control Consortium (2007). Genome- wide association study 
of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447 661-678. 

West, M. and Harrison, J. (1997). Bayesian Forecasting and Dynamic Models, 2nd ed. 
Springer, New York. MR1482232 

Yu, Z. and Schaid, D. (2007). Methods to impute missing genotypes for population data. 
Human Genetics 122 495-504. 

Zeggini, E., Scott, L., Saxena, R., Voight, B., Marchini, J., Hu, T., de Barker, 
P., Abecasis, G., Almgren, P., Andersen, G., Ardlie, K., Bostrom, K., 
Bergman, R., Bonnycastle, L., Borch-Johnsen, K., Burtt, N., Chen, H., 
Chines, P., Daly, M., Deodhar, P., Ding, C, Doney, A., Duren, W., El- 
liott, K., Erdos, M., Frayling, T., Freathy, R., Gianniny, L., Grallert, H., 
Grarup, N., Groves, C, Guiducci, C, Hansen, T., Herder, C, Hitman, C, 
Hughes, T., Isomaa, B., Jackson, A., Jorgensen, T., Kong, A., Kubalanza, 
K., Kuruvilla, F., Kuusisto, J., Langenberg, C, Lango, H., Lauritzen, T., 
Li, Y., Lindgren, C, Lyssenko, V., Marvelle, A., Meisinger, C, Midthjell, 
K., Mohlke, K., Morken, M., Morris, A., Narisu, N., Nilsson, P., Owen, K., 
Palmer, C, Payne, F., Perry, J., Pettersen, E., Platou, C, Prokopenko, 
I., Qi, L., Qin, L., Rayner, N., Rees, M., Roix, J., Sandbaek, A., Shields, B., 
Sjogren, M., Steinthorsdottir, V., Stringham, H., Swift, A., Thorleifsson, 
G., Thorsteinsdottir, U., Timpson, N., Tuomi, T., Tuomilehto, J., Walker, 
M., Watanabe, R., Weedon, M., CJ, C. W., Wellcome Trust Case Control 
Consortium, Illig, T., Hveem, K., Hu, F., Laakso, M., Stefansson, K., Ped- 
ersen, O., Wareham, N., Barroso, I., Hattersley, A., Collins, F., Groop, L., 
McCarthy, M., Boehnke, M. and Altshuler, D. (2008). Meta-analysis of genome- 
wide association data and large-scale replication identifies additional susceptibility loci 
for type 2 diabetes. Nature Genetics 40 638-645. 



IMPUTING ALLELE FREQUENCIES USING LINEAR PREDICTORS 



27 



Department of Statistics 
University of Chicago 
Chicago, Illinois 60637 
USA 

E-MAIL: wen@uchicago.cdu 



Department of Statistics 
and 

Department of Human Genetics 
University of Chicago 
Chicago, Illinois 60637 
USA 

E-MAIL: mstephens@uchicago.cdu 



